Note: To organize our responses, questions in the book chapters appear in bold text. We’ve done the same for each section of the Tidy Tuesday assignment prompt.
Libaries used in Ch. 4 exercises
library(gapminder)
library(tidyverse)
library(socviz)
library(ggrepel)
Explore gapminder data set
head(gapminder %>%
arrange(desc(gdpPercap)))
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Kuwait Asia 1957 58.0 212846 113523.
## 2 Kuwait Asia 1972 67.7 841934 109348.
## 3 Kuwait Asia 1952 55.6 160000 108382.
## 4 Kuwait Asia 1962 60.5 358266 95458.
## 5 Kuwait Asia 1967 64.6 575003 80895.
## 6 Kuwait Asia 1977 69.3 1140357 59265.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = pop))
p + geom_line(aes(group = country)) + facet_wrap(~year)
p + facet_wrap(~year)
Healy_Ch.4_4.8.1 <- p + geom_line(aes(group = country)) + facet_wrap(~country)
ggsave("Healy_4.8.1.pdf", plot=Healy_Ch.4_4.8.1, height=8, width=12)
ggsave("Healy_4.8.1.pdf", plot=Healy_Ch.4_4.8.1, height=20, width=40)
# in addition, we also generated a separate plot below by filtering down to select European countries while analyzing the same variables for a more manageable viz
euro_gpd_pop <- gapminder %>%
filter(continent == "Europe", country %in% c("France", "Italy", "Germany", "Turkey")) %>%
ggplot(aes(x = gdpPercap, y = pop)) + geom_line() + facet_wrap(~country, ncol = 4)
euro_gpd_pop + geom_smooth(size = 1.1, method = "loess", se = FALSE) + labs(x = "GDP per Captia",
y = "Population",
title = "GDP per capita on Five Continents")
Age_Childs <-ggplot(gss_sm,mapping = aes(x = age, y = childs))
Age_Childs +
geom_point()+
geom_smooth() +
facet_grid(sex ~ race)
Age_Childs +
geom_point()+
geom_smooth() +
facet_grid(~sex + race)
facet_ grid(sex ~ race) creates facets breaking out the data by sex (in the rows) and race (in the columns). facet_grid(~ sex + race) creats facets breaking out the data by sex and race using columns only, combining each uniqu race+sex value into a column.
Age_Childs +
geom_point() +
geom_smooth() +
facet_wrap(~sex + race)
# store object
life_expectancy_versus_per_capita_GDP <- ggplot(data = gapminder, mapping = aes(x = lifeExp, y = gdpPercap))
# plot
life_expectancy_versus_per_capita_GDP + geom_bin2d(bins=c(40,60))
percbelowpoverty_percollege <- ggplot(data = midwest, mapping = aes(x = percbelowpoverty, y=percollege))
percbelowpoverty_percollege + geom_density_2d() + geom_point() +
labs(x = "% Below Poverty Line", y = "% Attended College", caption =
"Note: Demographic information of midwest counties\n from 2000 US census")
Libraries used for Chapter 5 exercises
library(gapminder)
library(tidyverse)
library(socviz)
The following code chunks were developed as practice before answering the ‘Where to Next’ questions in the chapter.
rel_by_region <- gss_sm %>%
group_by(bigregion, religion) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N),
pct = round((freq*100), 0))
rel_by_region %>% group_by(bigregion) %>% summarize(total = sum(pct))
## # A tibble: 4 × 2
## bigregion total
## <fct> <dbl>
## 1 Northeast 100
## 2 Midwest 101
## 3 South 100
## 4 West 101
p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position='dodge2') +
labs(x = "Region",y = "Percent", fill = "Religion") +
theme(legend.position = "top")
p + geom_col(position = "dodge2") +
labs(x = NULL, y = "Percent", fill = "Religion") +
guides(fill = FALSE) +
coord_flip() +
facet_grid(~ bigregion)
organdata %>% select(1:6) %>% sample_n(size = 10)
## # A tibble: 10 × 6
## country year donors pop pop_dens gdp
## <chr> <date> <dbl> <int> <dbl> <int>
## 1 Denmark 1991-01-01 11.7 5154 12.0 19126
## 2 Finland 1993-01-01 19.6 5066 1.50 17082
## 3 Canada NA NA NA NA NA
## 4 Netherlands 1993-01-01 15.4 15290 36.8 19856
## 5 United States 1998-01-01 21 275854 2.86 31612
## 6 Canada 1992-01-01 12.6 28377 0.285 19590
## 7 Italy 1992-01-01 5.8 56859 18.9 18883
## 8 Denmark NA NA 5141 11.9 18285
## 9 Netherlands 2000-01-01 12.6 15926 38.3 26873
## 10 France 1996-01-01 15.1 58026 10.5 21990
ggplot(organdata)+
aes(x=year, y=donors)+
geom_line(aes(group=country))+
facet_wrap(~country)
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm=TRUE),
y = donors, fill = world))
p + geom_boxplot() + labs(x=NULL) +
coord_flip() + theme(legend.position = "top")
by_country <- organdata %>% group_by(consent_law, country) %>%
summarize_if(is.numeric, funs(mean, sd), na.rm = TRUE) %>%
ungroup()
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean)))
p + geom_point(size=3) +
facet_wrap(~ consent_law,scales = "free_y", ncol = 1) +
labs(x= "Donor Procurement Rate",
y= "")
library(ggrepel)
p <- ggplot(data = by_country, mapping = aes(x = roads_mean,
y = donors_mean))
p + geom_text_repel(mapping = aes(label = country))
elections_historic %>% select(2:7)
## # A tibble: 49 × 6
## year winner win_party ec_pct popular_pct popular_margin
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1824 John Quincy Adams D.-R. 0.322 0.309 -0.104
## 2 1828 Andrew Jackson Dem. 0.682 0.559 0.122
## 3 1832 Andrew Jackson Dem. 0.766 0.547 0.178
## 4 1836 Martin Van Buren Dem. 0.578 0.508 0.142
## 5 1840 William Henry Harrison Whig 0.796 0.529 0.0605
## 6 1844 James Polk Dem. 0.618 0.495 0.0145
## 7 1848 Zachary Taylor Whig 0.562 0.473 0.0479
## 8 1852 Franklin Pierce Dem. 0.858 0.508 0.0695
## 9 1856 James Buchanan Dem. 0.588 0.453 0.122
## 10 1860 Abraham Lincoln Rep. 0.594 0.396 0.101
## # … with 39 more rows
p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"
p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct,
label = winner_label))
p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
geom_point() +
geom_text_repel() +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
caption = p_caption)
by_year <- ggplot(subset(elections_historic, reorder= year>1991))
# subset the elections data; year greater than 1988 as next election was 1992
after_1992 <- elections_historic %>% group_by(year) %>% subset(year>1988)
# pre-setting labels as objects so we can use later
p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"
p <- ggplot(data=elections_historic, aes(x = popular_pct, y = ec_pct,
label = winner_label, color=win_party, shape=win_party))
p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
geom_point() + geom_smooth()+
geom_text_repel(data=subset(after_1992)) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) + labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,caption = p_caption)
after_1992 <- elections_historic %>% group_by(year) %>% subset(year>1988)
p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"
p <- ggplot(data=elections_historic, aes(x = popular_pct, y = ec_pct,
label = winner_label, color=win_party, shape=win_party))
p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
geom_point() +
geom_text_repel(data=subset(after_1992)) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
caption = p_caption)
p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"
p <- ggplot(data=elections_historic, aes(x = popular_pct, y = ec_pct,
label = winner_label, color=win_party, shape=win_party))
p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
geom_point() +
geom_text_repel(data=subset(after_1992)) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
caption = p_caption)
p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"
p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct,
label = winner_label))
p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
geom_point() +
geom_text_repel() +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
caption = p_caption)+
annotate("rect", xmin = 0.3, xmax=0.5, ymin =0.5, ymax=1, fill="red", alpha=0.1)
library(ggthemes) # used below for the minimal theme
# first, using mutate to calculate avg. life expectancy
gapminder %>%
filter(year == 2007) %>%
select(continent, lifeExp) %>%
group_by(continent) %>%
mutate(continent_mean = mean(lifeExp)) %>%
ggplot(aes(x = continent_mean, y = reorder(continent, continent_mean))) + geom_point() +
labs(title = "Mean Life Expectancy in 2007 by Continent", x = "Life Expectancy (Years)",
y = "Continent") + theme_minimal()
# same plot as above but using summarize instead of mutate to calculate means
gapminder %>%
filter(year == 2007) %>%
select(continent, lifeExp) %>% group_by(continent) %>%
summarize(exp_mean = mean(lifeExp)) %>%
ggplot(aes(x = exp_mean, y = reorder(continent, exp_mean))) + geom_point() +
labs(title = "Mean Life Expectancy in 2007 by Continent", x = "Life Expectancy (Years)",
y = "Continent") + theme_minimal()
# Healy example:
gss_sm %>% group_by(race, degree) %>%
summarize(N = n()) %>%
mutate(pct = round(N / sum(N)*100, 0))
## # A tibble: 18 × 4
## # Groups: race [3]
## race degree N pct
## <fct> <fct> <int> <dbl>
## 1 White Lt High School 197 9
## 2 White High School 1057 50
## 3 White Junior College 166 8
## 4 White Bachelor 426 20
## 5 White Graduate 250 12
## 6 White <NA> 4 0
## 7 Black Lt High School 60 12
## 8 Black High School 292 60
## 9 Black Junior College 33 7
## 10 Black Bachelor 71 14
## 11 Black Graduate 31 6
## 12 Black <NA> 3 1
## 13 Other Lt High School 71 26
## 14 Other High School 112 40
## 15 Other Junior College 17 6
## 16 Other Bachelor 39 14
## 17 Other Graduate 37 13
## 18 Other <NA> 1 0
gss_sm %>% group_by(race, degree) %>% summarize(N = n()) %>%
mutate(pct = round(N/sum(N) * 100, 0)) %>% summarize(total = sum(pct))
## `summarise()` has grouped output by 'race'. You can override using the `.groups`
## argument.
## # A tibble: 3 × 2
## race total
## <fct> <dbl>
## 1 White 99
## 2 Black 100
## 3 Other 99
As expected, this sums to 100 within each race, with a bit of rounding error.
gss_sm %>% group_by(region, degree) %>% summarize(N = n()) %>%
mutate(pct = round(N/sum(N) * 100, 0))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
## # A tibble: 49 × 4
## # Groups: region [9]
## region degree N pct
## <fct> <fct> <int> <dbl>
## 1 New England Lt High School 13 7
## 2 New England High School 77 44
## 3 New England Junior College 19 11
## 4 New England Bachelor 39 22
## 5 New England Graduate 27 15
## 6 Middle Atlantic Lt High School 25 8
## 7 Middle Atlantic High School 160 51
## 8 Middle Atlantic Junior College 21 7
## 9 Middle Atlantic Bachelor 59 19
## 10 Middle Atlantic Graduate 47 15
## # … with 39 more rows
gss_sm %>% group_by(region, degree) %>% summarize(N = n()) %>%
mutate(pct = round(N/sum(N) * 100, 0))%>% summarize(total = sum(pct))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
## # A tibble: 9 × 2
## region total
## <fct> <dbl>
## 1 New England 99
## 2 Middle Atlantic 100
## 3 E. Nor. Central 100
## 4 W. Nor. Central 100
## 5 South Atlantic 100
## 6 E. Sou. Central 100
## 7 W. Sou. Central 100
## 8 Mountain 100
## 9 Pacific 100
As expected, this sums to 100 within each region, with a bit of rounding error.
gss_sm %>% group_by(degree) %>%
summarize(childsmean=mean(childs, na.rm=TRUE),
childsmedian=median(childs, na.rm=TRUE))
## # A tibble: 6 × 3
## degree childsmean childsmedian
## <fct> <dbl> <dbl>
## 1 Lt High School 2.81 3
## 2 High School 1.86 2
## 3 Junior College 1.77 2
## 4 Bachelor 1.45 1
## 5 Graduate 1.52 2
## 6 <NA> 3.6 4
# boxplot with life expectancy over time (years)
ggplot(gapminder, aes(x=year, y=lifeExp, group=continent)) +
geom_boxplot()
# same as above but first grouping by year and continent first and grouping by year to show life expectancy distributions within continent across years
gapminder %>%
group_by(continent, year) %>%
ggplot(aes(x = year, y = lifeExp, group = year)) + geom_boxplot() + facet_wrap(~continent)
#?geom_boxplot() # view geom_boxplot() documentation
# notched boxplot
gapminder %>%
filter(continent == "Europe") %>%
ggplot(aes(x = year, y = lifeExp, group = year)) + geom_boxplot(notch = TRUE)
# adjust varwidth options
gapminder %>%
filter(continent == "Europe") %>%
ggplot(aes(x = year, y = lifeExp, group = year)) + geom_boxplot(varwidth = TRUE)
# violin plot 1
ggplot(gapminder, aes(x=year, y=lifeExp, group= year, fill=continent)) +
geom_violin() + facet_wrap(~continent)
# violin plot 2
gapminder %>%
filter(year %in% c("1957", "2007"), !continent == "Oceania") %>%
ggplot(aes(x = year, y = lifeExp, group = year, fill = year)) + geom_violin() + facet_wrap(~continent) +
theme(legend.position = "none") + labs(title = "Life Expectancy in 1952 and 2007 by Continent",
x = "Year", y = "Life Expectancy")
Load relevant packages
library("ggplot2")
library("GGally")
library("ggExtra")
library("ggalluvial")
library("plotly")
library("cowplot")
library("data.table")
First, we retrieve region6 which is a subset of the pisa data set. Additionally, before answering the questions in the chapter, we perform code cleaning tasks that were applied to the data set in Ch.4 to ready the data for visualization.
pisa <- fread("random6.csv", na.strings = "")
#pisa <- fread("region6.csv",na.strings = "")
# check data
#head(pisa) # not print for the HTML file as it returns a large output
Recode variables
bin.to.num <- function(x) {
if (is.na(x)) NA
else if (x =="Yes") 1L
else if (x == "No") 0L
}
#Create and recode variables
pisa[, `:=`
(female = ifelse(ST004D01T == "Female", 1, 0),
sex = ST004D01T,
# At my house we have ...
desk = sapply(ST011Q01TA, bin.to.num),
own.room = sapply(ST011Q02TA, bin.to.num),
quiet.study = sapply(ST011Q03TA, bin.to.num),
computer = sapply(ST011Q04TA, bin.to.num),
software = sapply(ST011Q05TA, bin.to.num),
internet = sapply(ST011Q06TA, bin.to.num),
lit = sapply(ST011Q07TA, bin.to.num),
poetry = sapply(ST011Q08TA, bin.to.num),
art = sapply(ST011Q09TA, bin.to.num),
book.sch = sapply(ST011Q10TA, bin.to.num),
tech.book = sapply(ST011Q11TA, bin.to.num),
dict = sapply(ST011Q12TA, bin.to.num),
art.book = sapply(ST011Q16NA, bin.to.num))]
pisa[, `:=`
(math = rowMeans(pisa[, c(paste0("PV", 1:10, "MATH"))], na.rm = TRUE),reading = rowMeans(pisa[, c(paste0("PV", 1:10, "READ"))], na.rm = TRUE), science = rowMeans(pisa[, c(paste0("PV", 1:10, "SCIE"))], na.rm = TRUE))]
Filter big data based on a list of countries (called “country”)
country <- c("United States", "Canada", "Mexico", "B-S-J-G (China)", "Japan",
"Korea", "Germany", "Italy", "France", "Brazil", "Colombia", "Uruguay",
"Australia", "New Zealand", "Jordan", "Israel", "Lebanon")
dat <- pisa[CNT %in% country,
.(CNT, OECD, CNTSTUID, W_FSTUWT, sex, female,
ST001D01T, computer, software, internet,
ST011Q05TA, ST071Q02NA, ST071Q01NA, ST123Q02NA,
ST082Q01NA, ST119Q01NA, ST119Q05NA, ANXTEST,
COOPERATE, BELONG, EMOSUPS, HOMESCH, ENTUSE,
ICTHOME, ICTSCH, WEALTH, PARED, TMINS, ESCS,
TEACHSUP, TDTEACH, IBTEACH, SCIEEFF,
math, reading, science)
]
Create additional variables for data visualization
dat <- dat[, `:=` (
# New grade variable
grade = (as.numeric(sapply(ST001D01T, function(x) {
if(x=="Grade 7") "7"
else if (x=="Grade 8") "8"
else if (x=="Grade 9") "9"
else if (x=="Grade 10") "10"
else if (x=="Grade 11") "11"
else if (x=="Grade 12") "12"
else if (x=="Grade 13") NA_character_
else if (x=="Ungraded") NA_character_}))),
# Total learning time as hours
learning = round(TMINS/60, 0),
# Regions for selected countries
Region = (sapply(CNT, function(x) {
if(x %in% c("Canada", "United States", "Mexico")) "N. America"
else if (x %in% c("Colombia", "Brazil", "Uruguay")) "S. America"
else if (x %in% c("Japan", "B-S-J-G (China)", "Korea")) "Asia"
else if (x %in% c("Germany", "Italy", "France")) "Europe"
else if (x %in% c("Australia", "New Zealand")) "Australia"
else if (x %in% c("Israel", "Jordan", "Lebanon")) "Middle-East"
}))
)]
# N count for the final dataset
dat[,.N]
## [1] 35847
#158,061 rows
# Preview final data
head(dat)
## CNT OECD CNTSTUID W_FSTUWT sex female ST001D01T computer software
## 1: Germany Yes 27605365 94.71886 Female 1 Grade 9 1 1
## 2: Germany Yes 27601858 94.71886 Female 1 Grade 9 1 1
## 3: Germany Yes 27602008 94.71886 Female 1 Grade 9 1 0
## 4: Germany Yes 27601867 101.08900 Male 0 Grade 9 1 1
## 5: Germany Yes 27601457 101.08900 Male 0 Grade 9 1 1
## 6: Germany Yes 27609865 101.08900 Male 0 Grade 9 1 1
## internet ST011Q05TA ST071Q02NA ST071Q01NA ST123Q02NA ST082Q01NA
## 1: 1 Yes 2 1 Agree Agree
## 2: 1 Yes 6 4 Strongly agree Agree
## 3: 1 No 1 1 Strongly agree Agree
## 4: 1 Yes 1 1 Strongly agree Strongly agree
## 5: 1 Yes 1 0 Strongly agree Strongly agree
## 6: 1 Yes 7 1 Agree Strongly agree
## ST119Q01NA ST119Q05NA ANXTEST COOPERATE BELONG EMOSUPS
## 1: Strongly disagree Strongly disagree -0.0345 -0.2882 0.0171 -1.5266
## 2: Disagree Disagree -0.5792 1.0205 0.6642 1.0991
## 3: Strongly agree Agree -0.4331 2.2879 1.5043 1.0991
## 4: Agree Disagree 1.8135 0.9675 0.0048 1.0991
## 5: Agree Agree 0.6007 0.7898 -1.7499 1.0991
## 6: Strongly agree Agree -0.0582 0.6615 0.9357 0.0250
## HOMESCH ENTUSE ICTHOME ICTSCH WEALTH PARED TMINS ESCS TEACHSUP TDTEACH
## 1: -0.5712 -0.4097 NA NA 1.0228 15 1350 0.3306 -0.8040 -0.8671
## 2: -0.8913 -0.7527 NA NA -0.2249 13 1620 -0.4779 0.4879 -0.6849
## 3: 0.3373 0.4811 NA NA 1.3528 18 NA 0.9317 1.4475 0.5252
## 4: 0.0631 -0.1772 NA NA 2.3511 13 1620 0.6614 1.4475 0.9477
## 5: 0.1053 -0.0269 NA NA 0.3228 10 1620 -0.1385 0.0838 2.0781
## 6: -0.2290 0.3594 NA NA 0.2763 15 1620 0.2320 -0.3478 0.4505
## IBTEACH SCIEEFF math reading science grade learning Region
## 1: -0.6081 NA 487.5435 529.1613 502.8598 9 22 Europe
## 2: -0.1573 -2.8303 464.7986 516.5613 487.9087 9 27 Europe
## 3: 0.9882 -2.0922 595.8506 657.5961 579.7019 9 NA Europe
## 4: 2.4937 -0.4044 577.6913 546.5353 553.7394 9 27 Europe
## 5: 2.3047 0.1155 519.0156 471.8701 467.8655 9 27 Europe
## 6: 0.5711 0.3155 541.2141 540.0169 538.2691 9 27 Europe
ggplot(data = dat,
mapping = aes(x = reorder(CNT, math), y = math, fill = Region)) +
geom_point(aes(color = Region)) +
labs(x=NULL, y="Math Scores") +
coord_flip() +
geom_hline(yintercept = 490, linetype="dashed", color = "red", size = 1) +
theme_bw()
ggplot(data = dat,
mapping = aes(x = reorder(CNT, math), y = math, fill = Region)) +
geom_violin(aes(color = Region))+
labs(x=NULL, y="Math Scores") +
coord_flip() +
geom_hline(yintercept = 490, linetype="dashed", color = "red", size = 1) +
theme_bw()
ggplot(data = dat,
mapping = aes(x = science, fill = Region)) +
geom_histogram(alpha = 0.5, bins = 50) +
labs(x = "Science Scores", y = "Count",
title = "Science Scores by Gender and Region") +
facet_grid(. ~ sex) +
theme_bw()
# Random sample of 500 students from each region
dat_small <- dat[,.SD[sample(.N, min(500,.N))], by = Region]
ggpairs(data = dat_small,
mapping = aes(color = Region),
columns = c("reading", "science", "math", "sex"),
upper = list(continuous = wrap("cor", size = 2.5))
)
Visualizing multiple variables
Interpretation:
ggplot(data = dat_small,
mapping = aes(x = ESCS, y = math)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Socio-Economic Status", y = "Math Scores") +
theme_bw() +
theme(legend.title = element_blank()) +
facet_grid(sex ~ Region)
Below we create the summary data set (dat_alluvial2) for this plot.
dat_alluvial2 <- dat[,
.(Freq = .N),
by = c("Region", "sex", "ST123Q02NA")
][,
ST119Q01NA := as.factor(ifelse(ST123Q02NA == "", "Missing", ST123Q02NA))]
levels(dat_alluvial2$ST123Q02NA) <- c("Strongly disagree", "Disagree", "Agree",
"Strongly agree", "Missing")
head(dat_alluvial2)
## Region sex ST123Q02NA Freq ST119Q01NA
## 1: Europe Female Agree 822 Agree
## 2: Europe Female Strongly agree 1775 Strongly agree
## 3: Europe Male Strongly agree 1645 Strongly agree
## 4: Europe Male Agree 910 Agree
## 5: Europe Female Disagree 183 Disagree
## 6: Europe Male <NA> 546 <NA>
dat_alluvial2 %>% arrange(desc(Freq))
## Region sex ST123Q02NA Freq ST119Q01NA
## 1: Middle-East Female <NA> 2451 <NA>
## 2: N. America Female Strongly agree 2312 Strongly agree
## 3: N. America Male Strongly agree 2113 Strongly agree
## 4: Middle-East Male <NA> 2095 <NA>
## 5: S. America Female Strongly agree 1883 Strongly agree
## 6: Europe Female Strongly agree 1775 Strongly agree
## 7: Asia Male Agree 1742 Agree
## 8: S. America Male Strongly agree 1716 Strongly agree
## 9: Europe Male Strongly agree 1645 Strongly agree
## 10: Asia Female Strongly agree 1582 Strongly agree
## 11: Asia Female Agree 1467 Agree
## 12: Australia Female Strongly agree 1446 Strongly agree
## 13: Australia Male Strongly agree 1372 Strongly agree
## 14: N. America Male Agree 1223 Agree
## 15: Asia Male Strongly agree 1192 Strongly agree
## 16: N. America Female Agree 1115 Agree
## 17: S. America Female Agree 976 Agree
## 18: Europe Male Agree 910 Agree
## 19: S. America Male Agree 892 Agree
## 20: Europe Female Agree 822 Agree
## 21: Australia Male Agree 709 Agree
## 22: Australia Female Agree 649 Agree
## 23: Europe Male <NA> 546 <NA>
## 24: Europe Female <NA> 376 <NA>
## 25: Asia Male Disagree 256 Disagree
## 26: N. America Male Strongly disagree 238 Strongly disagree
## 27: N. America Female Strongly disagree 229 Strongly disagree
## 28: Asia Female Disagree 198 Disagree
## 29: Europe Female Disagree 183 Disagree
## 30: Europe Male Disagree 175 Disagree
## 31: N. America Male Disagree 139 Disagree
## 32: S. America Male <NA> 131 <NA>
## 33: N. America Female Disagree 119 Disagree
## 34: Asia Male Strongly disagree 111 Strongly disagree
## 35: S. America Female Disagree 108 Disagree
## 36: S. America Female <NA> 108 <NA>
## 37: S. America Male Disagree 90 Disagree
## 38: Australia Male <NA> 82 <NA>
## 39: S. America Female Strongly disagree 79 Strongly disagree
## 40: S. America Male Strongly disagree 79 Strongly disagree
## 41: Australia Male Strongly disagree 62 Strongly disagree
## 42: Australia Female Strongly disagree 55 Strongly disagree
## 43: Australia Female <NA> 54 <NA>
## 44: Australia Female Disagree 53 Disagree
## 45: N. America Male <NA> 52 <NA>
## 46: Asia Female Strongly disagree 46 Strongly disagree
## 47: Europe Female Strongly disagree 41 Strongly disagree
## 48: Australia Male Disagree 38 Disagree
## 49: Asia Male <NA> 37 <NA>
## 50: Europe Male Strongly disagree 31 Strongly disagree
## 51: N. America Female <NA> 28 <NA>
## 52: Asia Female <NA> 16 <NA>
## Region sex ST123Q02NA Freq ST119Q01NA
# cleaned colum names
dat_alluvial2 <- dat_alluvial2 %>%
rename(parent_support = ST123Q02NA) %>%
mutate(sex = as.factor(sex),
Region = as.factor(Region),
parent_support = as.factor(parent_support)) %>%
filter(!is.na(parent_support))
dat_alluvial2
## Region sex parent_support Freq ST119Q01NA
## 1: Europe Female Agree 822 Agree
## 2: Europe Female Strongly agree 1775 Strongly agree
## 3: Europe Male Strongly agree 1645 Strongly agree
## 4: Europe Male Agree 910 Agree
## 5: Europe Female Disagree 183 Disagree
## 6: Europe Male Disagree 175 Disagree
## 7: Europe Female Strongly disagree 41 Strongly disagree
## 8: Europe Male Strongly disagree 31 Strongly disagree
## 9: Asia Female Strongly agree 1582 Strongly agree
## 10: Asia Female Agree 1467 Agree
## 11: Asia Male Agree 1742 Agree
## 12: Asia Male Disagree 256 Disagree
## 13: Asia Male Strongly agree 1192 Strongly agree
## 14: Asia Female Strongly disagree 46 Strongly disagree
## 15: Asia Female Disagree 198 Disagree
## 16: Asia Male Strongly disagree 111 Strongly disagree
## 17: N. America Male Agree 1223 Agree
## 18: N. America Male Strongly agree 2113 Strongly agree
## 19: N. America Female Agree 1115 Agree
## 20: N. America Male Strongly disagree 238 Strongly disagree
## 21: N. America Female Strongly agree 2312 Strongly agree
## 22: N. America Female Disagree 119 Disagree
## 23: N. America Male Disagree 139 Disagree
## 24: N. America Female Strongly disagree 229 Strongly disagree
## 25: Australia Female Strongly agree 1446 Strongly agree
## 26: Australia Male Agree 709 Agree
## 27: Australia Female Agree 649 Agree
## 28: Australia Female Disagree 53 Disagree
## 29: Australia Male Strongly agree 1372 Strongly agree
## 30: Australia Male Disagree 38 Disagree
## 31: Australia Male Strongly disagree 62 Strongly disagree
## 32: Australia Female Strongly disagree 55 Strongly disagree
## 33: S. America Female Agree 976 Agree
## 34: S. America Female Strongly agree 1883 Strongly agree
## 35: S. America Male Strongly agree 1716 Strongly agree
## 36: S. America Male Agree 892 Agree
## 37: S. America Female Strongly disagree 79 Strongly disagree
## 38: S. America Male Disagree 90 Disagree
## 39: S. America Female Disagree 108 Disagree
## 40: S. America Male Strongly disagree 79 Strongly disagree
## Region sex parent_support Freq ST119Q01NA
Next, we use this data set to draw the alluvial plot plot.
# StatStratum <- StatStratum
ggplot(data = dat_alluvial2,
aes(axis1 = Region, axis2 = parent_support, y = Freq, label = after_stat(stratum))) +
scale_x_discrete(limits = c("Region", "Parents supporting\nachievement"),
expand = c(.1, .05)) +
geom_alluvium(aes(fill = sex)) +
geom_stratum() +
geom_text(stat = "stratum", label.strata = TRUE) +
labs(x = "Demographics", y = "Frequency", fill = "Gender") +
theme_bw()
There are multiple trends that are apparent from the alluvial plot. First, we see that there is variation within all regions for parental support. That is, we see that each region has different levels of parent support for their students. Second, in comparing regions, we see that some regions have, on average, a higher level of parental support for their students. For example, in the bottom region on the plot, we see that by and large, the region has a lower level of parent support than, say, the region at the top of the plot. Finally, while within each region there is variation on parental support, the proportion of each level of parental support remains consistent between genders in each region.
Science-by-Region Histogram:
ggplot(data = dat,
mapping = aes(x = science, fill = Region)) +
geom_histogram(alpha = 0.5, bins = 50) +
labs(x = "Science Scores", y = "Count",
title = "Science Scores by Gender and Region") +
facet_grid(. ~ sex) +
theme_bw()
Next, we present a science-by-Region Density Plot made interactive using plotly with alpha = 0.5:
p5 <- ggplot(data = dat,
mapping = aes(x = science, fill = Region)) +
geom_density(alpha = 0.5) +
labs(x = "Science Scores", y = "Count",
title = "Science Scores by Gender and Region") +
facet_grid(. ~ sex) +
theme_bw()
ggplotly(p5)
Science-by-Region Density Plot made interactive using plotly with alpha = 0.8:
p6 <- ggplot(data = dat,
mapping = aes(x = science, fill = Region)) +
geom_density(alpha = 0.8) +
labs(x = "Science Scores", y = "Count",
title = "Science Scores by Gender and Region") +
facet_grid(. ~ sex) +
theme_bw()
ggplotly(p6)
ggplot(data = dat,
mapping = aes(x = reading, fill = Region)) +
geom_histogram(alpha = 0.5, bins = 50) +
labs(x = "Reading Scores", y = "Count",
title = "Reading Scores by Gender and Region") +
facet_grid(. ~ sex) +
theme_bw()
p8 <- ggplot(data = dat,
mapping = aes(x = reading, fill = Region)) +
geom_density(alpha = 0.5) +
labs(x = "Reading Scores", y = "Count",
title = "Reading Scores by Gender and Region") +
facet_grid(. ~ sex) +
theme_bw()
ggplotly(p8)
In a Tidy Tuesday screen cast originally presented in February 2021, David Robinson explored an open-source data set from the National Science Foundation (NSF) that contains the number of PhDs awarded each year by academic discipline. For this assignment, we have revised the below visualizations that David Robinson created during the screen cast to show the number of PhDs awarded by field over time.
The data set that was originally presented in the Tidy Tuesday screen cast can be found here at the following link: https://github.com/rfordatascience/tidytuesday/blob/master/data/2019/2019-02-19/phd_by_field.csv
For each academic discipline, the data set contains multiple levels of distinctions between academic area of study. Specifically, the data set has categories for broad field, major, and minor disciplines. Broad fields are umbrella disciplines with multiple sub-disciplines nested within. For example, in the following code chunk we see the various disciplines that are nested within the Health Sciences field, which are in turn rolled-up under the Life Sciences broad field.
library(tidyverse)
library(readr)
urlfile="https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-19/phd_by_field.csv"
phd_raw <- read_csv(url(urlfile)) # store a raw version in case needed later after data cleaning
## Parsed with column specification:
## cols(
## broad_field = col_character(),
## major_field = col_character(),
## field = col_character(),
## year = col_double(),
## n_phds = col_double()
## )
# store as data frame
phd_df <- as.data.frame(phd_raw)
# example of 10 rows showing the various levels of field stored in the data set
phd_df %>%
group_by(broad_field, major_field, field) %>%
filter(major_field == "Health sciences") %>% head(10)
## # A tibble: 10 × 5
## # Groups: broad_field, major_field, field [10]
## broad_field major_field field year n_phds
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Life sciences Health sciences Environmental health 2008 69
## 2 Life sciences Health sciences Environmental toxicologyc 2008 43
## 3 Life sciences Health sciences Gerontology (health sciences) 2008 NA
## 4 Life sciences Health sciences Health and behavior 2008 NA
## 5 Life sciences Health sciences Health services research 2008 NA
## 6 Life sciences Health sciences Health systems administration 2008 70
## 7 Life sciences Health sciences Kinesiology, exercise sciencef 2008 212
## 8 Life sciences Health sciences Medical physics, radiological sci… 2008 NA
## 9 Life sciences Health sciences Nursing science 2008 504
## 10 Life sciences Health sciences Oral biology, oral pathology 2008 NA
We propose two main steps to improve David Robinson’s line graph of PhDs awarded by field over time. First, to enhance the line graph’s ability to convey differences in degrees awarded by field over time, we will apply principles of Gestalt psychology to the visualization. In particular, we plan to apply five principles, each of which are discussed in greater detail below in our section on the application of Gestalt theory. Second, to enhance the visualization’s ability to convey a meaningful story, we have decided to limit the overall dates presented in the overall data and to focus on one academic discipline as opposed to all 7 broad fields. Our processes for accomplishing each of these goals is detailed in section E below.
I. Data Cleaning
# rename field to 'minor field'
phd_clean <- phd_df %>%
rename(minor_field = field)
# store field as a factor below so that it's a group not character string and rename to match
# David Robinson's naming convention from his Tidy Tuesday screencast
phd_clean <- phd_clean %>%
mutate(field = as.factor(broad_field),
year = as.factor(year))
# for our line plot, we will visualize the average degrees awarded each year by field
phds_by_year <- phd_clean %>%
filter(!is.na(n_phds)) %>%
group_by(year, field) %>%
summarize(num_degrees = mean(n_phds))
library(ggtext) # used to successfully render HTML code for color words in title
phds_by_year %>%
mutate(highlight = ifelse(field=="Education", "Education", "Other Fields")) %>%
ggplot(aes(x=year, y=num_degrees, group= field, color=highlight, size=highlight)) +
geom_line() +
scale_color_manual(values = c("#69b3a2", "grey50")) +
scale_size_manual(values=c(1.5,0.2)) +
theme(legend.position="none") +
labs(
title = "<strong><span style='color:#000000'>The number of PhDs in <strong><span style='color:#69b3a2'>education </strong></b>have decreased in recent years", x = "Year", y = "Degrees Awarded") +
geom_label(x= 4, y = 185,label="Education PhDs in decline", size=4, color="#69b3a2") +
theme(
legend.position="none",
plot.title = element_text(size=14)) + theme(plot.title = element_markdown())
As mentioned in our plan above, we applied five principles for enhancing data visualizations vis-a-vis Schwabish (2021), whose work draws on insights from Gestalt theory and visual perception research. The following paragraphs state the principle that we used and discuss our process of applying the principle to enhance the original visualization.
The first two principles that we leveraged from Schwabish (2021)‘s writing on Gestalt and visual perception research were to ’show the data’ and ‘start with gray’. By showing the data, Schwabish recommends that analysts should, when possible, limit the salience of data that is not relevant to the analysis in focus. For our Tidy Tuesday revision, while we enjoyed seeing Robinson’s trend lines for all academic disciplines, we felt that the chart was bordering on what would be considered a ‘spaghetti chart’, That is, we felt that Robinson’s visualization of 7 simultaneous trend lines placed too much cognitive demand on the audience. Thus, in applying the ‘show the data’ principle, we chose to highlight just one trend line, PhDs awarded in education. To do so, we isolated the education trends line through the application of the second principle of ‘starting with gray’. For this principle, Schwabish suggests that by making all chart elements gray at the outset of a project, the analyst becomes more strategic in their application of color to the chart. This was true for our Tidy Tuesday revision. By starting with gray, we forced ourselves to restrict our use of color in order to make the education trend line stand out.
The next principle from Schwabish (2021) that we applied was to integrate chart graphics and text. Specifically, by focusing on our selective use of color, we formatted the word ‘education’ in our chart title to match the color of the education trend line. This principle of integrating chart and text elements connects to Gestalt theorist’ work on similarity and pre-attentive processing. Gestalt theories on similarity draw on the general idea that humans tend to perceive elements that share similar characteristics as being related to one another. Building off of similarity, pre-attentive processing describes unconscious cognitive processes that occur in relation to patterns in the visual field. Analysts can leverage the Gestalt notion of similarity, say, to prime an audience to pre-attentively perceive connections between otherwise disparate elements in a visual field. For our project, our selective use of color for the chart title and trend line was a deliberate attempt to prime our audience to unconsciously perceive the connection between these features of our visualization.
Having integrated chart and text elements, we were then ready to reduce clutter, a fourth principle of Schwabish’s. Scwabish (2021) suggests that the unnecessary use of visualization elements can be distracting to readers. We identified that while the original visualization’s use of a legend in the line plot was necessary for identifying the academic disciplines in Robinson’s plot, we no longer had a need for a legend following the application of the aforementioned principles. Finally, after the above processes, our last enhancement to the original Tidy Tuesday visualization was to apply a title that was written like a headline. Schwabish recommends that, as often as possible, titles for data visualizations should be written in a way that underscores what a reader should learn from a visualization. By doing so, titles that are focused on conveying a particular message to the reader minimizes processing requirements placed on a reader during encoding phase of visual processing. Moreover, Schwabish cites research that suggests that so-called ‘active titles’ are associated with greater recall in participant recollection of information conveyed through data visualization. Thus, for our project, we crafted a title that spoke directly to the narrative that we wished to highlight, which was that the number of education PhDs awarded in the U.S. has declined over the past decade.
We experienced two issues during our revision of the visualization
that we selected. Both issues occurred when we were applying Schwabish’s
(2021) recommendation to integrate graph and text elements. First, we
experienced difficulty coloring the single word ‘education’ in our title
to the color of the trend line. We quickly learned that in order to
change specific title colors, we needed the library ggtext
installed. However, our biggest hang-up in selectively formatting the
title’s coloring was learning how to format the HTML code. Specifically,
the ggtext library allows users to write HTML code to tell
R which word needs to be formatted and in what manner. After trial and
error, we were able to correctly color the word ‘education’ to match the
trend line.
The second challenge that we ran into was positioning the chart annotation in the correct coordinates in the graph. Through early iteration and attempts, it was not immediately clear to us why we could not place the annotation on the correct spot in our chart space. The annotation wasn’t showing up when we were attempting to set the x-values for the chart position to years. We realized however that since we had previously re-formatted our year variable to a factor data type, we would not be able to index the year column in reference to the numeric value of the year. Instead, with a factor data type, ggplot wanted us to refer to the level of year (e.g. 1 through 10) for the x-value. Once we made this connection to the data type for our year variable, we were able to place the annotation in our desired spot on the chart.